Stats 405 Final Project
Introduction
My objective for this project is to create a MySQL database with web-scraped Historical Movies Box Office Data. For the scope of this project, the database will be on my local system, but this can be created on the cloud server as well utilizing the same methods.
After setting up the database, I will clean the tables and pull the data. Using the extracted data, I will also perform some basic data analysis.
This project will consist of several components taught in this class such as web-scraping HTML tables, storing data in MySQL database, and cleaning and analyzing data using R packages/functions.
Gathering Data
First, we need to pull Box Office data from publicly available online sources. There is a website called TheNumbers.com which is the leading online box-office reporting service. Website link is below: The Numbers
The website provides all-time Opening/Total-Gross Box Office and Theater Data for all the movies both domestically and internationally. We will focus on domestic data (U.S. + Canada) for this project. We will use the following R codes to go into web page of each movie’s data summary and pull corresponding data into R. All data will be cleaned to take care of missing data and the structures.
##########################################################
## Get the master movie list from Website with Parallel ##
##########################################################
library(doParallel)
library(foreach)
cores=detectCores()
cl <- makeCluster(cores[1]-1)
# cl = makePSOCKcluster(2)
registerDoParallel(cl)
getDoParWorkers()
# There are two ways of storing data on the website: Letter with less movies are not grouped, movies listed on a single page
# Letter with a lot of movies are grouped using second letters.
UngroupedPages <- c(1:10,c("Q","X","Z"))
GroupedPages <- setdiff(LETTERS, c("Q","X","Z"))
# For Ungrouped Pages
master.table = foreach(Letter = UngroupedPages, .packages = c('foreach','rvest'), .combine = rbind) %dopar% {
original.url = paste0("https://www.the-numbers.com/movies/letter/",Letter)
length = length(read_html(original.url)%>% html_nodes("td") %>% html_text())
url = read_html(original.url)%>% html_nodes("a") %>% html_attr("href")
url = url[grepl(paste0("movie/",Letter),url)&grepl("#tab=summary",url)]
movieLinks = url
movieLinks2 = gsub("summary","box-office",movieLinks)
numOfMovies = length(movieLinks)
foreach(i = 1:numOfMovies, .packages = c('foreach','rvest'), .combine = rbind) %dopar% {
url2 = paste0("https://www.the-numbers.com", movieLinks[i])
title = read_html(url2) %>% html_nodes("h1") %>% html_text() %>% .[1]
url3 = paste0("https://www.the-numbers.com", movieLinks2[i])
info = tryCatch(read_html(url3) %>% html_nodes("table") %>% .[[4]] %>% html_table() %>% data.frame(), error = function(e){NA})
info2 = tryCatch(read_html(url3) %>% html_nodes("div#box_office_chart") %>% html_nodes("table") %>% html_table() %>% .[[1]] %>% data.frame(), error = function(e){NA})
info3 = tryCatch(read_html(url3) %>% html_nodes("div#page_filling_chart") %>% html_nodes("table") %>% html_table() %>% .[[1]] %>% data.frame(), error = function(e){NA})
if (!is.null(nrow(info)) & !is.null(nrow(info2)) & !is.null(nrow(info3))) {
if (!is.null(info2$Total.Gross[1])) {
movie = data.frame(title = title, Release.Date = info2$Date[1], Genre = ifelse("Genre:" %in% info$X1, info$X2[info$X1 == "Genre:"], ""), Running.Time = ifelse("Running Time:" %in% info$X1, info$X2[info$X1 == "Running Time:"], ""),
Box.Office.Open = info2$Total.Gross[1], Theaters.Open = info2$Theaters[1], Box.Office.Total.Gross = info3$Amount[1], stringsAsFactors = FALSE)
movie
}
}
}
}
# For Grouped Pages
master.table2 = foreach(Letter = GroupedPages, .packages = c('foreach','rvest'), .combine = rbind) %dopar% {
original.url = paste0("https://www.the-numbers.com/movies/letter/",Letter)
length = length(read_html(original.url)%>% html_nodes("td") %>% html_text())
indexLinks = read_html(original.url)%>% html_nodes("a") %>% html_attr("href")
indexLinks = indexLinks[grepl(paste0("/movies/letter/",Letter,"/"),indexLinks)]
numOfPages = length(indexLinks)
foreach(Page = 1:numOfPages, .packages = c('foreach','rvest'), .combine = rbind) %dopar% {
url = paste0("https://www.the-numbers.com", indexLinks[Page])
movieLinks = read_html(url)%>% html_nodes("a") %>% html_attr("href")
movieLinks = movieLinks[grepl(paste0("movie/",Letter),movieLinks)&grepl("#tab=summary",movieLinks)]
movieLinks2 = gsub("summary","box-office",movieLinks)
numOfMovies = length(movieLinks)
foreach(i = 1:numOfMovies, .packages = c('foreach','rvest'), .combine = rbind) %dopar% {
url2 = paste0("https://www.the-numbers.com", movieLinks[i])
title = read_html(url2) %>% html_nodes("h1") %>% html_text() %>% .[1]
url3 = paste0("https://www.the-numbers.com", movieLinks2[i])
info = tryCatch(read_html(url3) %>% html_nodes("table") %>% .[[4]] %>% html_table() %>% data.frame(), error = function(e){NA})
info2 = tryCatch(read_html(url3) %>% html_nodes("div#box_office_chart") %>% html_nodes("table") %>% html_table() %>% .[[1]] %>% data.frame(), error = function(e){NA})
info3 = tryCatch(read_html(url3) %>% html_nodes("div#page_filling_chart") %>% html_nodes("table") %>% html_table() %>% .[[1]] %>% data.frame(), error = function(e){NA})
if (!is.null(nrow(info)) & !is.null(nrow(info2)) & !is.null(nrow(info3))) {
if (!is.null(info2$Total.Gross[1])) {
movie = data.frame(title = title, Release.Date = info2$Date[1], Genre = ifelse("Genre:" %in% info$X1, info$X2[info$X1 == "Genre:"], ""), Running.Time = ifelse("Running Time:" %in% info$X1, info$X2[info$X1 == "Running Time:"], ""),
Box.Office.Open = info2$Total.Gross[1], Theaters.Open = info2$Theaters[1], Box.Office.Total.Gross = info3$Amount[1], stringsAsFactors = FALSE)
movie
}
}
}
}
}
stopCluster(cl)
master.table = rbind(master.table, master.table2)
# Since this web scraping takes a long time, we will save this data and load it before sending the data to SQL server.
write.csv(master.table, "MovieData.csv", row.names = F)
movies <- fread("MovieData.csv")
movies <- movies[order(movies$title),]Storing Data
With the collected Box Office data, now we will set up a MySQL database so we can easily access the data from a SQL database without the time-consuming data pulling and cleaning each time.
We can run SQL commands in R. If you do not already have MySQL on your computer, you can simply download the community version from MySQL website:
From the installation, there is a software called MySQL Workbench which has visual modeling and lets you write and execute SQL scripts as well as see the data tables on the server.
Once you have created an account, you can access your MySQL server through R using the gbvvvv package. The code below creates tables and sets keys for Box Office data.
library(RMySQL)
thenumbers <- dbConnect(MySQL(), user='root', password="password", host='localhost')
dbSendQuery(thenumbers, "CREATE DATABASE thenumbers")
dbSendQuery(thenumbers, "USE thenumbers")
dbWriteTable(thenumbers, value = movies, name = "movies", append=FALSE, overwrite=TRUE)Accessing/Cleaning and Analyzing Data
We can now query the table we created in the MySQL database into R
movies <- as.data.frame(dbGetQuery(thenumbers, "SELECT * FROM movies"))# for the time being, I can pull the data directly from the folder
setwd("C:/Users/luxur/Desktop/DAN's Folder/STATS 405/Final")
movies <- fread("MovieData.csv")
movies <- movies[order(movies$title),]- Cleaning Data
I will clean the data and go over some example analysis we can perform using the stored box office data.
# remove rows that contain empty data
movies <- unique(movies)
movies <- subset(movies, movies$Genre != "" & movies$Running.Time != "")
# Change variables to proper types of data
movies$Release.Date = as.Date(movies$Release.Date, format = "%b %d, %Y")
movies$Running.Time = as.numeric(gsub(" minutes","",movies$Running.Time))
movies$Theaters.Open = as.numeric(gsub("[$,]","",movies$Theaters.Open))
movies$Box.Office.Open = as.numeric(gsub("[$,]","",movies$Box.Office.Open))
movies$Box.Office.Total.Gross = as.numeric(gsub("[$,]","",movies$Box.Office.Total.Gross))- Correlation Plot
cor <- cor(movies[,c("Running.Time","Theaters.Open","Box.Office.Open","Box.Office.Total.Gross")])
corrplot.mixed(cor)As expected, the correlation plot shows Box Office numbers are correlated with number of theaters open for the movies. Now I will make some visuals to show relationships between variables.
- Analyzing Data
agg <- aggregate(movies[,c("Running.Time","Theaters.Open","Box.Office.Open","Box.Office.Total.Gross")], by = list(movies$Genre), FUN = mean)
names(agg)[1] = "Genre"
p <- plot_ly(agg, x = ~Genre, y = ~Theaters.Open, type = 'bar', name = 'Theaters.Open') %>%
add_trace(y = ~Box.Office.Open, name = 'Box.Office.Open') %>%
add_trace(y = ~Box.Office.Total.Gross, name = 'Box.Office.Total.Gross') %>%
layout(title = "Average Box Office Numbers by Genre", yaxis = list(title = 'Box Office Numbers'), barmode = 'group')
pOn Average, Action and Adventure have the highest Opening and Gross Box Office revenues followed by Comedy and Drama. In theory, people generally love Action and Adventure movies which support this graph. Interestingly, Musical Genre has high Gross Box Office value even if its opening value wasn’t as highly ranked. We can say Musical movies grow popular over time of movies shown in theaters.
movies$Running.Time.Category[movies$Running.Time <= 60] = "Less than 1 Hr"
movies$Running.Time.Category[movies$Running.Time <= 90 & movies$Running.Time > 60] = "1 Hr - 1 Hr 30 Min"
movies$Running.Time.Category[movies$Running.Time <= 120 & movies$Running.Time > 90] = "1 Hr 30 Min - 2 Hrs"
movies$Running.Time.Category[movies$Running.Time <= 180 & movies$Running.Time > 120] = "2 Hrs - 3 Hrs"
movies$Running.Time.Category[movies$Running.Time > 180] = "Longer than 3 Hrs"
agg <- aggregate(movies[,c("Theaters.Open","Box.Office.Open","Box.Office.Total.Gross")], by = list(movies$Running.Time.Category), FUN = mean)
names(agg)[1] = "Running.Time"
p <- plot_ly(agg, x = ~Running.Time, y = ~Theaters.Open, type = 'bar', name = 'Theaters.Open') %>%
add_trace(y = ~Box.Office.Open, name = 'Box.Office.Open') %>%
add_trace(y = ~Box.Office.Total.Gross, name = 'Box.Office.Total.Gross') %>%
layout(title = "Average Box Office Numbers by Running Time", yaxis = list(title = 'Box Office Numbers'), barmode = "group")
pOn Average, movies with longer than 2 Hrs are popular.
agg <- aggregate(movies[,c("Theaters.Open","Box.Office.Open","Box.Office.Total.Gross")], by = list(substr(as.character(movies$Release.Date),1,4)), FUN = mean)
names(agg)[1] = "Release.Date"
p <- plot_ly(agg, x = ~Release.Date, y = ~Theaters.Open, mode = 'lines', type = 'scatter', name = 'Theaters.Open') %>%
add_trace(y = ~Box.Office.Open, name = 'Box.Office.Open') %>%
add_trace(y = ~Box.Office.Total.Gross, name = 'Box.Office.Total.Gross') %>%
layout(title = "Average Box Office Numbers by Release Year", yaxis = list(title = 'Box Office Numbers'))
pOn Average, Box Office revenues are not necessarily increasing over time probably because there are a lot more movies nowadays which fail to generate renevues.
agg <- aggregate(movies[,c("Theaters.Open","Box.Office.Open","Box.Office.Total.Gross")], by = list(substr(as.character(movies$Release.Date),1,4)), FUN = sum)
names(agg)[1] = "Release.Date"
p <- plot_ly(agg, x = ~Release.Date, y = ~Theaters.Open, mode = 'lines', type = 'scatter', name = 'Theaters.Open') %>%
add_trace(y = ~Box.Office.Open, name = 'Box.Office.Open') %>%
add_trace(y = ~Box.Office.Total.Gross, name = 'Box.Office.Total.Gross') %>%
layout(title = "Total Box Office Numbers by Release Year", yaxis = list(title = 'Box Office Numbers'))
pOn Total values, however, Box Office revenues are increasing over time because there are a lot more movies nowadays.
This data have more potential for analysis. Future projects could include joining the box office data with other publicly available data like Rotten Tomato scores or IMDB scores.